1 Fitting Distributions

The next step is to fit \(RI_{Q}\) to statistical distributions, and see if using the other distributions improves the scores. The selected alternative distributions are Logistic, Log Normal, and Gamma. The MASS package in R was used with the mean least squares parameter estimation algorithm. Each score is calculated as 2\(*\)lower tail (if below the median) or 2\(*\)upper tail (if above the median).

1.1 Parameter Fits

The next step is to capture the amount of variation in parameter fitting \(RI_Q\) to each distribution.

Figure 1: Maximum likelihood parameter estimates for each of the four test distributions (top row: gamma, second row: log normal, third row: logistic, last row: normal), where the point indicates the estimation and the lines indicated +/- 1 standard error, with the exception of gamma which has both of these values log transformed. Red is indicative of a point where the standard error of the estimate failed to calculate.

Parameter estimations for normal, log normal, and logistic are reasonable, with relatively small standard deviations of the estimate (Fig. 1). Gamma distribution estimates have high or NA standard deviations of their estimates, which is indicative of a poor fit.

1.2 Kolmogorov-Smirnov Statistic

The Komologorov-Smirnov (KS) test indicates how close the cumulative density functions of two distributions (in this case, the actual \(RI_Q\) and its estimated distribution) by taking the maximum distance between the two curves.

Figure 2: The Kolmogorov-Smirnov (KS) statistic of the distance between the query retention index \(RI_Q\) distribution and a modeled distri-bution using maximum likelihood estimation. The “best” label refers to a distribution with the lowest KS statistic, and the other refers to a distribution that has a p-value “within 0.05” of the best distribution.

The three non-normal distributions tend to have lower KS statistics than the normal distributions (Fig. 2), which may mean these distributions fit the true retention index query distribution better than the normal distribution. There is evidence to suggest that using the kernels for these distributions as an RI score may lead to more improved ranks than adjusting the normal distribution score.

## # A tibble: 5 × 2
##   Distribution Count
##   <chr>        <int>
## 1 Logistic        64
## 2 Log Normal      12
## 3 Gamma           10
## 4 Original         1
## 5 Normal           0

1.3 Akaike Information Criterion

The Akaike Information Criterion (AIC) estimates the prediction error and therefore the relative quality of the distributional fit. Lower AICs indicate better fits. AICs were scaled to the minimum value in each row by subtracting all AICs by their minimum AIC value.

Figure 3: The Akaike Information Criterion (AIC) statistic of the distance between the \(RI_Q\) distribution and an estimate of that distribution using maximum likelihood estimation. AIC has been scaled to the minimum AIC per metabolite, and the binned as listed in the legend above where 0-2 exclusive is Gray is indicative of a distribution that failed to calculate estimated parameters.

The AIC statistic also provides evidence to support that other distributions may fit \(RI_Q\) better than the normal (Fig. 3). According to the AIC statistic, normal may be an appropriate estimation more often than is suggested by the KS statistic. The most interesting find of the AIC statistic, is that logistic fit is rarely far removed from being the best fit.

2 Rank Changes

Figure 4: The proportion of identifications at rank 1 for each scoring method. Gamma, Log Normal, Logistic, Normal Kernel (original RI score), and Normal Probability results are calculated with the 10% holdout analysis.

Overall, retention index scores that use values from each \(RI_Q\) distribution tend to perform better than the original score for our holdout analysis on the full dataset for true positives at rank 1 (Fig. 4).

Figure 5: Proportions of true positives (left two) and true negatives (right two) at ranks 1 and 5 per metabolite, following the 10% holdout analysis with all input data.

We see that the other scores tend to outperform the original at ranks 1 and 5 for both true positives and true negatives.

4 Holdout with Subset Only

The final question relates to whether the results change if we do a holdout analysis with only our subset metabolites and all their true positive, true negative, and unknown annotations.

Figure 9: Proportions of true positives (left two) and true negatives (right two) at ranks 1 and 5 per metabolite, following the 10% holdout analysis with only metabolites within our subset.

If we reduce our holdout analysis to just our subsetted compounds, the other scores still outperform the original (Fig. 9).

5 Publication Figures and Tables

## # A tibble: 4 × 8
## # Groups:   Truth [2]
##   Truth          Rank Original `Original Adjusted`  Gamma Log N…¹ Logis…² Normal
##   <chr>         <dbl>    <dbl>               <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
## 1 True Positive     1   0.0508              0.214  0.215   0.213   0.242  0.213 
## 2 True Positive     5   0.370               0.742  0.742   0.740   0.700  0.741 
## 3 True Negative     1   0.0084              0.0088 0.0086  0.0087  0.0092 0.0087
## 4 True Negative     5   0.0793              0.055  0.0558  0.0548  0.063  0.0546
## # … with abbreviated variable names ¹​`Log Normal`, ²​Logistic

## # A tibble: 4 × 8
## # Groups:   Truth [2]
##   Truth          Rank Original `Original Adjusted`  Gamma Log N…¹ Logis…² Normal
##   <chr>         <dbl>    <dbl>               <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
## 1 True Positive     1    0.487               0.785 0.784   0.784   0.810  0.785 
## 2 True Positive     5    0.896               0.999 0.999   0.999   0.998  0.999 
## 3 True Negative     1    0.129               0.101 0.0704  0.0694  0.0859 0.0696
## 4 True Negative     5    0.764               0.749 0.744   0.744   0.755  0.744 
## # … with abbreviated variable names ¹​`Log Normal`, ²​Logistic

6 Poster Figures